[amdgpu] Part3 update runtime module #6486

galeselee · 2022-10-31T17:20:43Z

Brief Summary

This is a special part of the Tacihi runtime module for the AMDGPU backend. Tacihi's runtime module uses clang++ to generate LLVM IR is different in memory allocation differs from the cpu-generated LLVM IR. The following is an example.

C/C++ code
void func(int *a, int *b) {
    *a = *b;
}
x86_64 backend LLVM IR
define dso_local void @cpu_func(i32* %0, i32* %1) #2 {
  %3 = alloca i32*, align 8
  %4 = alloca i32*, align 8
  store i32* %0, i32** %3, align 8
  store i32* %1, i32** %4, align 8
  %5 = load i32*, i32** %4, align 8
  %6 = load i32, i32* %5, align 4
  %7 = load i32*, i32** %3, align 8
  store i32 %6, i32* %7, align 4
  ret void
}
__global__ function on AMDGPU
define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) #4 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = alloca i32*, align 8, addrspace(5)
  %6 = alloca i32*, align 8, addrspace(5)
  %7 = addrspacecast i32* addrspace(5)* %3 to i32**
  %8 = addrspacecast i32* addrspace(5)* %4 to i32**
  %9 = addrspacecast i32* addrspace(5)* %5 to i32**
  %10 = addrspacecast i32* addrspace(5)* %6 to i32**
  %11 = addrspacecast i32 addrspace(1)* %0 to i32*
  store i32* %11, i32** %7, align 8
  %12 = load i32*, i32** %7, align 8
  %13 = addrspacecast i32 addrspace(1)* %1 to i32*
  store i32* %13, i32** %8, align 8
  %14 = load i32*, i32** %8, align 8
  store i32* %12, i32** %9, align 8
  store i32* %14, i32** %10, align 8
  %15 = load i32*, i32** %10, align 8
  %16 = load i32, i32* %15, align 4
  %17 = load i32*, i32** %9, align 8
  store i32 %16, i32* %17, align 4
  ret void
}
__device__ function on AMDGPU
define hidden void @device_func(i32* %0, i32* %1) #2 {
  %3 = alloca i32*, align 8, addrspace(5)
  %4 = alloca i32*, align 8, addrspace(5)
  %5 = addrspacecast i32* addrspace(5)* %3 to i32**
  %6 = addrspacecast i32* addrspace(5)* %4 to i32**
  store i32* %0, i32** %5, align 8
  store i32* %1, i32** %6, align 8
  %7 = load i32*, i32** %6, align 8
  %8 = load i32, i32* %7, align 4
  %9 = load i32*, i32** %5, align 8
  store i32 %8, i32* %9, align 4
  ret void
}

There are some differences in the place about allocainst, specifically about addrspace (for AMDGPU, this will be helpful). I have not found documentation describing how to write the correct LLVM IR on AMDGPU, through my observation of the LLVM IR generated by clang++/hipcc. We need to deal with the arguments of the __global__ function and the allocainst (including specifying the addrspace of allocainst and performing addrspace-cast) while for the __device__ function we do not need to deal with the arguments of the function.

netlify · 2022-10-31T17:20:51Z

✅ Deploy Preview for docsite-preview ready!

Name	Link
🔨 Latest commit	`c015143`
🔍 Latest deploy log	https://app.netlify.com/sites/docsite-preview/deploys/63a5c7c38a18f0000834b948
😎 Deploy Preview	https://deploy-preview-6486--docsite-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

for more information, see https://pre-commit.ci

jim19930609

In general, for hardware-specifc code or configurations, we should comment a reference link to a specific chapter/part of AMD-GPU-guide whenever we use some magic strings or magic numbers. This is especially true for setting up function attributes or so.

The previously implemented CUDA backend is filled with magic/hack and lacks enough comments/explanations. Let's improve this situation starting from AMD GPU. Thanks!

taichi/runtime/llvm/llvm_runtime_executor.cpp

taichi/runtime/llvm/llvm_context.cpp

for more information, see https://pre-commit.ci

jim19930609

LGTM!

Issue: taichi-dev#6434 ### Brief Summary 1. This is a special part of the Tacihi runtime module for the `AMDGPU` backend. Tacihi's runtime module uses `clang++` to generate `LLVM IR` is different in memory allocation differs from the cpu-generated `LLVM IR`. The following is an example. ``` C/C++ code void func(int *a, int *b) { *a = *b; } x86_64 backend LLVM IR define dso_local void @cpu_func(i32* %0, i32* %1) taichi-dev#2 { %3 = alloca i32*, align 8 %4 = alloca i32*, align 8 store i32* %0, i32** %3, align 8 store i32* %1, i32** %4, align 8 %5 = load i32*, i32** %4, align 8 %6 = load i32, i32* %5, align 4 %7 = load i32*, i32** %3, align 8 store i32 %6, i32* %7, align 4 ret void } __global__ function on AMDGPU define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) taichi-dev#4 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = alloca i32*, align 8, addrspace(5) %6 = alloca i32*, align 8, addrspace(5) %7 = addrspacecast i32* addrspace(5)* %3 to i32** %8 = addrspacecast i32* addrspace(5)* %4 to i32** %9 = addrspacecast i32* addrspace(5)* %5 to i32** %10 = addrspacecast i32* addrspace(5)* %6 to i32** %11 = addrspacecast i32 addrspace(1)* %0 to i32* store i32* %11, i32** %7, align 8 %12 = load i32*, i32** %7, align 8 %13 = addrspacecast i32 addrspace(1)* %1 to i32* store i32* %13, i32** %8, align 8 %14 = load i32*, i32** %8, align 8 store i32* %12, i32** %9, align 8 store i32* %14, i32** %10, align 8 %15 = load i32*, i32** %10, align 8 %16 = load i32, i32* %15, align 4 %17 = load i32*, i32** %9, align 8 store i32 %16, i32* %17, align 4 ret void } __device__ function on AMDGPU define hidden void @device_func(i32* %0, i32* %1) taichi-dev#2 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = addrspacecast i32* addrspace(5)* %3 to i32** %6 = addrspacecast i32* addrspace(5)* %4 to i32** store i32* %0, i32** %5, align 8 store i32* %1, i32** %6, align 8 %7 = load i32*, i32** %6, align 8 %8 = load i32, i32* %7, align 4 %9 = load i32*, i32** %5, align 8 store i32 %8, i32* %9, align 4 ret void } ``` 2. There are some differences in the place about `allocainst`, specifically about addrspace (for `AMDGPU`, [this](https://llvm.org/docs/AMDGPUUsage.html#address-spaces) will be helpful). I have not found documentation describing how to write the correct `LLVM IR` on `AMDGPU`, through my observation of the `LLVM IR` generated by `clang++/hipcc`. We need to deal with the arguments of the `__global__` function and the `allocainst` (including specifying the addrspace of `allocainst` and performing addrspace-cast) while for the `__device__` function we do not need to deal with the arguments of the function. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: ##6434 ### Brief Summary These unit tests are for #6486 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Issue: taichi-dev#6434 ### Brief Summary 1. This is a special part of the Tacihi runtime module for the `AMDGPU` backend. Tacihi's runtime module uses `clang++` to generate `LLVM IR` is different in memory allocation differs from the cpu-generated `LLVM IR`. The following is an example. ``` C/C++ code void func(int *a, int *b) { *a = *b; } x86_64 backend LLVM IR define dso_local void @cpu_func(i32* %0, i32* %1) taichi-dev#2 { %3 = alloca i32*, align 8 %4 = alloca i32*, align 8 store i32* %0, i32** %3, align 8 store i32* %1, i32** %4, align 8 %5 = load i32*, i32** %4, align 8 %6 = load i32, i32* %5, align 4 %7 = load i32*, i32** %3, align 8 store i32 %6, i32* %7, align 4 ret void } __global__ function on AMDGPU define protected amdgpu_kernel void @global_func(i32 addrspace(1)* %0, i32 addrspace(1)* %1) taichi-dev#4 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = alloca i32*, align 8, addrspace(5) %6 = alloca i32*, align 8, addrspace(5) %7 = addrspacecast i32* addrspace(5)* %3 to i32** %8 = addrspacecast i32* addrspace(5)* %4 to i32** %9 = addrspacecast i32* addrspace(5)* %5 to i32** %10 = addrspacecast i32* addrspace(5)* %6 to i32** %11 = addrspacecast i32 addrspace(1)* %0 to i32* store i32* %11, i32** %7, align 8 %12 = load i32*, i32** %7, align 8 %13 = addrspacecast i32 addrspace(1)* %1 to i32* store i32* %13, i32** %8, align 8 %14 = load i32*, i32** %8, align 8 store i32* %12, i32** %9, align 8 store i32* %14, i32** %10, align 8 %15 = load i32*, i32** %10, align 8 %16 = load i32, i32* %15, align 4 %17 = load i32*, i32** %9, align 8 store i32 %16, i32* %17, align 4 ret void } __device__ function on AMDGPU define hidden void @device_func(i32* %0, i32* %1) taichi-dev#2 { %3 = alloca i32*, align 8, addrspace(5) %4 = alloca i32*, align 8, addrspace(5) %5 = addrspacecast i32* addrspace(5)* %3 to i32** %6 = addrspacecast i32* addrspace(5)* %4 to i32** store i32* %0, i32** %5, align 8 store i32* %1, i32** %6, align 8 %7 = load i32*, i32** %6, align 8 %8 = load i32, i32* %7, align 4 %9 = load i32*, i32** %5, align 8 store i32 %8, i32* %9, align 4 ret void } ``` 2. There are some differences in the place about `allocainst`, specifically about addrspace (for `AMDGPU`, [this](https://llvm.org/docs/AMDGPUUsage.html#address-spaces) will be helpful). I have not found documentation describing how to write the correct `LLVM IR` on `AMDGPU`, through my observation of the `LLVM IR` generated by `clang++/hipcc`. We need to deal with the arguments of the `__global__` function and the `allocainst` (including specifying the addrspace of `allocainst` and performing addrspace-cast) while for the `__device__` function we do not need to deal with the arguments of the function. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…7023) Issue: #taichi-dev#6434 ### Brief Summary These unit tests are for taichi-dev#6486 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

galeselee added 4 commits October 31, 2022 21:13

alter global function's arg address

e5a32a1

add misc api

8e91537

add update addrspace module

3546946

fix scope

a0ec473

galeselee mentioned this pull request Oct 31, 2022

Add support for AMDGPU #6434

Open

add macro control

7fd4225

galeselee force-pushed the amdgpu_update_runtime_module branch from 7db2df8 to 7fd4225 Compare October 31, 2022 17:40

del extra control

82b4ccb

galeselee force-pushed the amdgpu_update_runtime_module branch from f48f3f8 to 82b4ccb Compare October 31, 2022 17:52

fix typo

d30e24b

galeselee force-pushed the amdgpu_update_runtime_module branch from 068e095 to d30e24b Compare October 31, 2022 17:56

[pre-commit.ci] auto fixes from pre-commit.com hooks

660b98d

for more information, see https://pre-commit.ci

turbo0628 requested a review from jim19930609 November 1, 2022 01:43

jim19930609 suggested changes Nov 7, 2022

View reviewed changes

galeselee closed this Nov 14, 2022

galeselee reopened this Dec 2, 2022

galeselee marked this pull request as draft December 2, 2022 07:49

update pass to llvm_pass

1e8c1f4

galeselee marked this pull request as ready for review December 22, 2022 13:20

galeselee requested a review from jim19930609 December 22, 2022 13:21

fix bug and solve conversation

325732a

galeselee force-pushed the amdgpu_update_runtime_module branch from d18335c to 325732a Compare December 22, 2022 13:28

del extra header file in llvm_context_pass.h

bd6b58d

galeselee force-pushed the amdgpu_update_runtime_module branch from 823e7a0 to bd6b58d Compare December 23, 2022 15:21

[pre-commit.ci] auto fixes from pre-commit.com hooks

c015143

for more information, see https://pre-commit.ci

jim19930609 approved these changes Dec 27, 2022

View reviewed changes

galeselee merged commit d321d12 into taichi-dev:master Dec 30, 2022

galeselee mentioned this pull request Jan 1, 2023

[amdgpu] Add convert addressspace pass related unit test #7023

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[amdgpu] Part3 update runtime module #6486

[amdgpu] Part3 update runtime module #6486

galeselee commented Oct 31, 2022 •

edited

Loading

netlify bot commented Oct 31, 2022 •

edited

Loading

jim19930609 left a comment •

edited

Loading

jim19930609 left a comment

[amdgpu] Part3 update runtime module #6486

[amdgpu] Part3 update runtime module #6486

Conversation

galeselee commented Oct 31, 2022 • edited Loading

Brief Summary

netlify bot commented Oct 31, 2022 • edited Loading

✅ Deploy Preview for docsite-preview ready!

jim19930609 left a comment • edited Loading

Choose a reason for hiding this comment

jim19930609 left a comment

Choose a reason for hiding this comment

galeselee commented Oct 31, 2022 •

edited

Loading

netlify bot commented Oct 31, 2022 •

edited

Loading

jim19930609 left a comment •

edited

Loading